Fundamental Techniques in Data Science with R

Today

This lecture

  • Merging Data / Data Table

  • Linear models

  • Robust linear modeling

  • Logistic models

  • Visualizing marginal effects

  • Time series in R

  • Principal components analysis

  • Clustering

Packages used

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)
library(DAAG)
## Loading required package: lattice
library(mice)
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(bench)
## 
## Attaching package: 'bench'
## The following object is masked from 'package:DAAG':
## 
##     press
library(ggplot2)  # Plotting suite
library(class)    # K-nearest Neighbour
library(mvtnorm)  # Multivariate Normal tools
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(caret)

Joining data in base R

Merging data sets

Let’s consider the following data example:

df1 <- data.frame(id = c(1:6), 
                  city = c(rep("Utrecht", 3), rep("Rotterdam", 3)))
df2 <- data.frame(ID = c(2, 4, 6), 
                  year = c(1984, 1982, 1990))
head(df1)
##   id      city
## 1  1   Utrecht
## 2  2   Utrecht
## 3  3   Utrecht
## 4  4 Rotterdam
## 5  5 Rotterdam
## 6  6 Rotterdam
head(df2)
##   ID year
## 1  2 1984
## 2  4 1982
## 3  6 1990

Inner join

merge(x = df1, y = df2, by.x = "id", by.y = "ID")
##   id      city year
## 1  2   Utrecht 1984
## 2  4 Rotterdam 1982
## 3  6 Rotterdam 1990

This is a natural join, or inner join. Only the cases from df1 that have a match in df2 are returned.

Outer join

merge(x = df1, y = df2, by.x = "id", by.y = "ID", all = TRUE)
##   id      city year
## 1  1   Utrecht   NA
## 2  2   Utrecht 1984
## 3  3   Utrecht   NA
## 4  4 Rotterdam 1982
## 5  5 Rotterdam   NA
## 6  6 Rotterdam 1990

All rows are returned

Outer join TOWA

merge(y = df1, x = df2, by.y = "id", by.x = "ID", all = TRUE)
##   ID year      city
## 1  1   NA   Utrecht
## 2  2 1984   Utrecht
## 3  3   NA   Utrecht
## 4  4 1982 Rotterdam
## 5  5   NA Rotterdam
## 6  6 1990 Rotterdam

Left outer join

merge(x = df1, y = df2, by.x = "id", by.y = "ID", all.x = TRUE)
##   id      city year
## 1  1   Utrecht   NA
## 2  2   Utrecht 1984
## 3  3   Utrecht   NA
## 4  4 Rotterdam 1982
## 5  5 Rotterdam   NA
## 6  6 Rotterdam 1990

Left outer join TOWA

merge(y = df1, x = df2, by.y = "id", by.x = "ID", all.x = TRUE)
##   ID year      city
## 1  2 1984   Utrecht
## 2  4 1982 Rotterdam
## 3  6 1990 Rotterdam

Right outer join

merge(x = df1, y = df2, by.x = "id", by.y = "ID", all.y = TRUE)
##   id      city year
## 1  2   Utrecht 1984
## 2  4 Rotterdam 1982
## 3  6 Rotterdam 1990

joining data with dplyr

Data sets

band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

inner join

band_members %>% inner_join(band_instruments)
## Joining, by = "name"
## # A tibble: 2 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass

dplyr::inner_join(x, y) returns all rows from x where there are matching values in y, and all columns from x and y.

  • If there are multiple matches between x and y, all combination of the matches are returned.

left join

band_members %>% left_join(band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

dplyr::left_join(x, y) returns all rows from x, and all columns from x and y.

  • Rows in x with no match in y will have NA values in the new columns.
  • If there are multiple matches between x and y, all combinations of the matches are returned.

right join

band_members %>% right_join(band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass  
## 3 Keith <NA>    guitar

dplyr::right_join(x, y) returns all rows from y, and all columns from x and y.

  • Rows in y with no match in x will have NA values in the new columns.
  • If there are multiple matches between x and y, all combinations of the matches are returned.

full join

band_members %>% full_join(band_instruments)
## Joining, by = "name"
## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

dplyr::full_join(x, y) returns all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Filtering joins: semi

  • dplyr::semi_join(x, y)
band_members %>% semi_join(band_instruments, by = "name")
## # A tibble: 2 x 2
##   name  band   
##   <chr> <chr>  
## 1 John  Beatles
## 2 Paul  Beatles
  • dplyr::anti_join(x, y)
band_members %>% anti_join(band_instruments, by = "name")
## # A tibble: 1 x 2
##   name  band  
##   <chr> <chr> 
## 1 Mick  Stones

nest join

band_members %>% nest_join(band_instruments, by = "name")
## # A tibble: 3 x 3
##   name  band    band_instruments
## * <chr> <chr>   <list>          
## 1 Mick  Stones  <tibble [0 × 1]>
## 2 John  Beatles <tibble [1 × 1]>
## 3 Paul  Beatles <tibble [1 × 1]>

Tibbles

a <- band_members %>% nest_join(band_instruments)
## Joining, by = "name"
a$band_instruments
## [[1]]
## # A tibble: 0 x 1
## # … with 1 variable: plays <chr>
## 
## [[2]]
## # A tibble: 1 x 1
##   plays 
##   <chr> 
## 1 guitar
## 
## [[3]]
## # A tibble: 1 x 1
##   plays
##   <chr>
## 1 bass

Tibbles

Tibbles are a modern take on data frames with a couple of improvements. For example, tibbles are more memory efficient:

set.seed(123)
l <- replicate(26, sample(1000), simplify = FALSE)
names(l) <- letters

timing <- bench::mark(
  as_tibble(l),
  as.data.frame(l),
  check = FALSE)

timing
## # A tibble: 2 x 6
##   expression            min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 as_tibble(l)        252µs 446.73µs     2157.   189.9KB     4.29
## 2 as.data.frame(l)    996µs   1.77ms      552.    4.91KB     4.12

Naming with tibble

a <- data.frame(abc = 1)
a$a
## [1] 1
a <- tibble(abc = 1)
a$a
## Warning: Unknown or uninitialised column: 'a'.
## NULL
a$abc
## [1] 1

Linear models

The linear model

Notation

The mathematical formulation of the relationship between variables can be written as

\[ \mbox{observed}=\mbox{predicted}+\mbox{error} \]

or (for the greek people) in notation as \[y=\mu+\varepsilon\]

where

  • \(\mu\) (mean) is the part of the outcome that is explained by model
  • \(\varepsilon\) (residual) is the part of outcome that is not explained by model

Univariate expectation

Conditional expectation

Assumptions

The key assumptions

There are four key assumptions about the use of linear regression models.

In short, we assume

  • The outcome to have a linear relation with the predictors and the predictor relations to be additive.
    • the expected value for the outcome is a straight-line function of each predictor, given that the others are fixed.
    • the slope of each line does not depend on the values of the other predictors
    • the effects of the predictors on the expected value are additive

    \[ y = \alpha + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\]

  • The residuals are statistically independent
  • The residual variance is constant
    • accross the expected values
    • across any of the predictors
  • The residuals are normally distributed with mean \(\mu_\epsilon = 0\)

A simple model

fit <- anscombe %$%
  lm(y1 ~ x1)
fit
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Coefficients:
## (Intercept)           x1  
##      3.0001       0.5001
fit2 <- anscombe %$%
  lm(y2 ~ x2)

Visualizing the assumptions

par(mfrow = c(2, 2))
plot(fit)

Visualizing the assumptions

par(mfrow = c(2, 2))
plot(fit2)

Model fit

A simple model

boys.fit <- 
  na.omit(boys) %$% # Extremely wasteful
  lm(age ~ reg)
boys.fit
## 
## Call:
## lm(formula = age ~ reg)
## 
## Coefficients:
## (Intercept)      regeast      regwest     regsouth      regcity  
##     14.7109      -0.8168      -0.7044      -0.6913      -0.6663
boys %>% na.omit(boys) %$% aggregate(age, list(reg), mean)
##   Group.1        x
## 1   north 14.71094
## 2    east 13.89410
## 3    west 14.00657
## 4   south 14.01965
## 5    city 14.04460

Plotting the model

means <- boys %>% na.omit(boys) %>% group_by(reg) %>% summarise(age = mean(age))
ggplot(na.omit(boys), aes(x = reg, y = age)) + 
  geom_point(color = "grey") + 
  geom_point(data = means, stat = "identity", size = 3)

Model parameters

boys.fit %>%
  summary()
## 
## Call:
## lm(formula = age ~ reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Scientific notation

If you have trouble reading scientific notation, 2e-16 means the following

\[2\text{e-16} = 2 \times 10^{-16} = 2 \times (\frac{1}{10})^{-16}\]

This indicates that the comma should be moved 16 places to the left:

\[2\text{e-16} = 0.0000000000000002\]

Is it a good model?

boys.fit %>%
  anova()
## Analysis of Variance Table
## 
## Response: age
##            Df Sum Sq Mean Sq F value Pr(>F)
## reg         4   14.7  3.6747  0.3701 0.8298
## Residuals 218 2164.6  9.9293

It is not a very informative model. The anova is not significant, indicating that the contribution of the residuals is larger than the contribution of the model.

The outcome age does not change significantly when reg is varied.

AIC

Akaike’s An Information Criterion

boys.fit %>% 
  AIC()
## [1] 1151.684

What is AIC

AIC comes from information theory and can be used for model selection. The AIC quantifies the information that is lost by the statistical model, through the assumption that the data come from the same model. In other words: AIC measures the fit of the model to the data.

  • The better the fit, the less the loss in information
  • AIC works on the log scale:
    • \(\text{log}(0) = -\infty\), \(\text{log}(1) = 0\), etc.
  • the closer the AIC is to \(-\infty\), the better

Model comparison

A new model

Let’s add predictor hgt to the model:

boys.fit2 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt)

boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545

Another model

Let’s add wgt to the model

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt)

And another model

Let’s add wgt and the interaction between wgt and hgt to the model

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt)

is equivalent to

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt + hgt:wgt)

Model comparison

AIC(boys.fit, boys.fit2, boys.fit3, boys.fit4)
##           df       AIC
## boys.fit   6 1151.6839
## boys.fit2  7  836.3545
## boys.fit3  8  822.4164
## boys.fit4  9  823.0386

And with anova()

anova(boys.fit, boys.fit2, boys.fit3, boys.fit4)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt + wgt
## Model 4: age ~ reg + hgt * wgt
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    218 2164.59                                    
## 2    217  521.64  1   1642.94 731.8311 < 2.2e-16 ***
## 3    216  485.66  1     35.98  16.0276 8.595e-05 ***
## 4    215  482.67  1      2.99   1.3324    0.2497    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit3

boys.fit3 %>% anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6343    0.1667    
## hgt         1 1642.94 1642.94 730.7065 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0029 8.688e-05 ***
## Residuals 216  485.66    2.25                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit4

boys.fit4 %>% anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6368    0.1661    
## hgt         1 1642.94 1642.94 731.8311 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0276 8.595e-05 ***
## hgt:wgt     1    2.99    2.99   1.3324    0.2497    
## Residuals 215  482.67    2.24                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that reg and the interaction hgt:wgt are redundant

Remove reg

boys.fit5 <- 
  na.omit(boys) %$%
  lm(age ~ hgt + wgt)

Let’s revisit the comparison

anova(boys.fit, boys.fit2, boys.fit3, boys.fit5)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt + wgt
## Model 4: age ~ hgt + wgt
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    218 2164.59                                    
## 2    217  521.64  1   1642.94 730.7065 < 2.2e-16 ***
## 3    216  485.66  1     35.98  16.0029 8.688e-05 ***
## 4    220  492.43 -4     -6.77   0.7529    0.5571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The boys.fit5 model is better than the previous model - it has fewer parameters

Stepwise regression

We start with the full model, which contains all parameters for all columns.

The most straightforward way to go about this is by specifying the following model:

full.model <- lm(age ~ ., data = na.omit(boys))
full.model
## 
## Call:
## lm(formula = age ~ ., data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          wgt          bmi           hc  
##    2.556051     0.059987    -0.009846     0.142162    -0.024086  
##       gen.L        gen.Q        gen.C        gen^4        phb.L  
##    1.287455    -0.006861    -0.187256     0.034186     1.552398  
##       phb.Q        phb.C        phb^4        phb^5           tv  
##    0.499620     0.656277    -0.094722    -0.113686     0.074321  
##     regeast      regwest     regsouth      regcity  
##   -0.222249    -0.233307    -0.258771     0.188423

Stepwise regression - continued

We can then start with specifying the stepwise model. In this case we choose direction both.

step.model <- step(full.model, direction = "both", 
                      trace = FALSE)
step.model
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          bmi        phb.L        phb.Q  
##     2.07085      0.05482      0.10742      2.70328      0.28877  
##       phb.C        phb^4        phb^5           tv  
##     0.48492     -0.06225     -0.15167      0.07957

Other options are

  • forward: fit all univariate models, add the best predictor and continue.
  • backward: fit the full model, eliminate the worst predictor and continue.

Summary

step.model %>% summary
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5077 -0.7144 -0.0814  0.6850  3.1724 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.070854   1.576365   1.314  0.19036    
## hgt          0.054822   0.009986   5.490 1.13e-07 ***
## bmi          0.107415   0.030135   3.564  0.00045 ***
## phb.L        2.703275   0.437108   6.184 3.12e-09 ***
## phb.Q        0.288768   0.211836   1.363  0.17426    
## phb.C        0.484921   0.202856   2.390  0.01769 *  
## phb^4       -0.062246   0.208472  -0.299  0.76555    
## phb^5       -0.151667   0.240599  -0.630  0.52912    
## tv           0.079573   0.019600   4.060 6.89e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.172 on 214 degrees of freedom
## Multiple R-squared:  0.8651, Adjusted R-squared:   0.86 
## F-statistic: 171.5 on 8 and 214 DF,  p-value: < 2.2e-16

Stepwise regression - AIC

full.model <- lm(age ~ ., data = na.omit(boys))
step.model <- MASS::stepAIC(full.model, direction = "both", 
                      trace = FALSE)
step.model
## 
## Call:
## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys))
## 
## Coefficients:
## (Intercept)          hgt          bmi        phb.L        phb.Q  
##     2.07085      0.05482      0.10742      2.70328      0.28877  
##       phb.C        phb^4        phb^5           tv  
##     0.48492     -0.06225     -0.15167      0.07957

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

step.model %>%
  dfbeta() %>%
  head(n = 7)
##       (Intercept)           hgt           bmi        phb.L         phb.Q
## 3279 -0.142127090  0.0010828575 -0.0021654085 -0.021523825 -0.0006087783
## 3283 -0.005201914  0.0001127452 -0.0014100750  0.009068652 -0.0147293625
## 3296 -0.081439010  0.0004244061  0.0002603379 -0.009247562 -0.0071293767
## 3321 -0.084630826  0.0004186923  0.0008222594 -0.005498486 -0.0059276580
## 3323  0.154386486 -0.0004476521 -0.0052964367  0.042872835 -0.0213761347
## 3327 -0.046095468  0.0001081361  0.0011150546 -0.004047875 -0.0085461280
## 3357 -0.045933397  0.0001345605  0.0009770371 -0.005087053 -0.0062604484
##            phb.C        phb^4        phb^5            tv
## 3279 0.007427067 -0.004226982 -0.001213833  0.0001061106
## 3283 0.011761595 -0.005498411  0.001164307  0.0007051901
## 3296 0.008538302 -0.003688611  0.000842872  0.0002494383
## 3321 0.006839217 -0.003341741  0.000866896 -0.0002569914
## 3323 0.011758694 -0.006659755  0.000493897  0.0012317886
## 3327 0.007807966 -0.002901270  0.001499505  0.0003376220
## 3357 0.006152384 -0.002274777  0.001148552  0.0002329346

Prediction

Fitted values

Let’s use the simpler anscombe data example

fit <- anscombe %$% lm(y1 ~ x1)

y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1
##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
fit %>% predict(newdata = new.x1)
##         1         2         3         4         5         6         7 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727 
##         8         9        10        11        12        13        14 
##  7.000818  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 
##        15        16        17        18        19        20 
## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909

Predictions are draws from the regression line

pred <- fit %>% predict(newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients
## (Intercept)   new.x1$x1 
##   3.0000909   0.5000909
fit$coefficients
## (Intercept)          x1 
##   3.0000909   0.5000909

Prediction intervals

fit %>% predict(interval = "prediction")
##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

A prediction interval reflects the uncertainty around a single value. The confidence interval reflects the uncertainty around the mean prediction values.

Assessing predictive accuracy

K-fold cross-validation

  • Divide sample in \(k\) mutually exclusive training sets
  • Do for all \(j\in\{1,\dots,k\}\) training sets

    1. fit model to training set \(j\)
    2. obtain predictions for test set \(j\) (remaining cases)
    3. compute residual variance (MSE) for test set \(j\)
  • Compare MSE in cross-validation with MSE in sample
  • Small difference suggests good predictive accuracy

The original model

fit %>% summary()
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

K-fold cross-validation anscombe data

DAAG::CVlm(anscombe, fit, plotit=F, printit=T)
## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value Pr(>F)   
## x1         1   27.5   27.51      18 0.0022 **
## Residuals  9   13.8    1.53                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## fold 1 
## Observations in test set: 3 
##                  5    7    11
## x1          11.000 6.00 5.000
## cvpred       8.443 5.59 5.014
## y1           8.330 7.24 5.680
## CV residual -0.113 1.65 0.666
## 
## Sum of squares = 3.19    Mean square = 1.06    n = 3 
## 
## fold 2 
## Observations in test set: 4 
##                  2      6     8    10
## x1           8.000 14.000  4.00  7.00
## cvpred       7.572  9.681  6.17  7.22
## y1           6.950  9.960  4.26  4.82
## CV residual -0.622  0.279 -1.91 -2.40
## 
## Sum of squares = 9.86    Mean square = 2.47    n = 4 
## 
## fold 3 
## Observations in test set: 4 
##                 1     3    4     9
## x1          10.00 13.00 9.00 12.00
## cvpred       7.84  9.37 7.33  8.86
## y1           8.04  7.58 8.81 10.84
## CV residual  0.20 -1.79 1.48  1.98
## 
## Sum of squares = 9.35    Mean square = 2.34    n = 4 
## 
## Overall (Sum over all 4 folds) 
##   ms 
## 2.04

K-fold cross-validation anscombe data

  • residual variance sample is \(1.24^2 \approx 1.53\)
  • residual variance cross-validation is 2.04
  • regression lines in the 3 folds are similar
DAAG::CVlm(anscombe, fit, plotit="Observed", printit=F)

Plotting the residuals

DAAG::CVlm(anscombe, fit, plotit="Residual", printit=F)

K-fold cross-validation boys data

  • residual variance sample is 1
  • residual variance cross-validation is 1.46
  • regression lines in the 3 folds almost identical
DAAG::CVlm(na.omit(boys), step.model, plotit="Observed", printit=F)

Plotting the residuals

DAAG::CVlm(na.omit(boys), step.model, plotit="Residual", printit=F)

How many cases are used?

na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
## [1] 223

If we would not have used na.omit()

boys %$%
  lm(age ~ reg + hgt * wgt) %>%
  nobs()
## [1] 724
  • Logistic models

Visualizing marginal effects

Example: titanic data

We start this lecture with a data set that logs the survival of passengers on board of the disastrous maiden voyage of the ocean liner Titanic

titanic <- read.csv(file = "titanic.csv", header = TRUE, )
titanic %>% describe
##                         vars   n   mean     sd median trimmed   mad  min
## Survived                   1 887   0.39   0.49    0.0    0.36   0.0 0.00
## Pclass                     2 887   2.31   0.84    3.0    2.38   0.0 1.00
## Name*                      3 887 444.00 256.20  444.0  444.00 329.1 1.00
## Sex*                       4 887   1.65   0.48    2.0    1.68   0.0 1.00
## Age                        5 887  29.47  14.12   28.0   28.96  11.9 0.42
## Siblings.Spouses.Aboard    6 887   0.53   1.10    0.0    0.27   0.0 0.00
## Parents.Children.Aboard    7 887   0.38   0.81    0.0    0.19   0.0 0.00
## Fare                       8 887  32.31  49.78   14.4   21.50  10.3 0.00
##                         max range  skew kurtosis   se
## Survived                  1   1.0  0.47    -1.78 0.02
## Pclass                    3   2.0 -0.62    -1.29 0.03
## Name*                   887 886.0  0.00    -1.20 8.60
## Sex*                      2   1.0 -0.61    -1.63 0.02
## Age                      80  79.6  0.45     0.28 0.47
## Siblings.Spouses.Aboard   8   8.0  3.67    17.64 0.04
## Parents.Children.Aboard   6   6.0  2.73     9.63 0.03
## Fare                    512 512.3  4.76    32.99 1.67

Inspect the data set

str(titanic)
## 'data.frame':    887 obs. of  8 variables:
##  $ Survived               : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass                 : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name                   : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ...
##  $ Sex                    : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age                    : num  22 38 26 35 35 27 54 2 27 14 ...
##  $ Siblings.Spouses.Aboard: int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parents.Children.Aboard: int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare                   : num  7.25 71.28 7.92 53.1 8.05 ...

What sources of information

We have information on the following features.

Our outcome/dependent variable:

  • Survived: yes or no

Some potential predictors:

  • Sex: the passenger’s gender coded as c(male, female)
  • Pclass: the class the passenger traveled in
  • Age: the passenger’s age in years
  • Siblings.Spouses.Aboard: if siblings or spouses were also aboard
  • Parents.Children.Aboard: if the passenger’s parents or children were aboard

and more.

Hypothetically

We can start investigating if there are patterns in this data that are related to the survival probability.

For example, we could hypothesize based on the crede “women and children first” that

  • Age relates to the probability of survival in that younger passengers have a higher probability of survival
  • Sex relates to survival in that females have a higher probability of survival

Based on socio-economic status, we could hypothesize that

  • Pclass relates to the probability of survival in that higher travel class leads to a higher probability of survival

And so on.

A quick investigation

Is Age related?

Inspecting the data

titanic %$% table(Pclass, Survived)
##       Survived
## Pclass   0   1
##      1  80 136
##      2  97  87
##      3 368 119

It seems that the higher the class (i.e. 1 > 2 > 3), the higher the probability of survival.

We can verify this

titanic %$% table(Pclass, Survived) %>% prop.table(margin = 1) %>% round(digits = 2)
##       Survived
## Pclass    0    1
##      1 0.37 0.63
##      2 0.53 0.47
##      3 0.76 0.24

A more thorough inspection

Fitting the titanic models

Let’s fit these models and gradually build them up in the number of predictors.

fit1 <- titanic %$% 
  glm(Survived ~ Age, family = binomial(link="logit"))

fit2 <- titanic %$% 
  glm(Survived ~ Sex, family = binomial(link="logit"))

fit3 <- titanic %$% 
  glm(Survived ~ Pclass, family = binomial(link="logit"))

Survived ~ Age

titanic %$% histogram(~ Age | Survived == 1)

The distribution of Age for the survivors (TRUE) is different from the distribution of Age for the non-survivors (FALSE). Especially at the younger end there is a point mass for the survivors, which indicates that children have a higher probability of survival. However, it is not dramatically different.

Survived ~ Age

fit1 %>% summary %>% .$coefficients
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.20919    0.15949   -1.31   0.1897
## Age         -0.00877    0.00495   -1.77   0.0761

We can see that there is a trend. The log odds of Survived decreases with increasing Age. For every year increase in age, the log-odds of Survived decreases with -0.009. However, this decrease is not too convincing given the effect, the standard error and the size of the data. Hence, the p-value is a little on the larger side.

When we inspect the deviance

c(fit1$null.deviance, fit1$deviance)
## [1] 1183 1180

We see that there is almost no decrease in deviance between the empty model (i.e. the model with only an intercept) and the model with Age included. The difference in df is only 1 parameter.

Survived ~ Sex

titanic %$% histogram(~ Sex | Survived == 1)

Wow! These distributions are very different! Females seem to have a much higher probability of survival.

Survived ~ Sex

fit2 %>% summary %>% .$coefficients
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)     1.06      0.129    8.19 2.58e-16
## Sexmale        -2.51      0.167  -14.98 9.95e-51

The coefficients confirm this. The log odds of Survived decreases for males, when compared to females. The decrease is quite pronounced, too: -2.505.

When we inspect the deviance

c(fit2$null.deviance, fit2$deviance)
## [1] 1183  916

We see that there is a massive gain in the deviance between the empty model (i.e. the model with only an intercept) and the model with Sex included. The difference in df is only 1 parameter.

Survived ~ Pclass

titanic %$% histogram(~ Pclass | Survived == 1)

There is a very apparent difference between the distributions of the survivors and non-survivors over the classes. For example, we see that in 1st and 2nd class there are more survivors than non-survivors, while in the third class this relation is opposite.

Survived ~ Pclass

fit3 %>% summary %>% .$coefficients
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    1.439     0.2074    6.94 3.98e-12
## Pclass        -0.844     0.0872   -9.68 3.57e-22

Here, there is something wrong! Class is an ordered categorical variable, not a continuous variable. In other words, we cannot simply increase the log odds of Survived with -0.844 for every unit increase in Pclass.

Why not?

If we would interpret these coefficients ‘as is’, we would assume that class two is twice the value for class 1, that class three is 1.5 times the value for class 2, and so on.

Edit the data

titanic %<>% 
  mutate(Pclass = factor(Pclass, levels = c(3, 2, 1), ordered = FALSE))
str(titanic)
## 'data.frame':    887 obs. of  8 variables:
##  $ Survived               : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass                 : Factor w/ 3 levels "3","2","1": 1 3 1 3 1 1 3 1 1 2 ...
##  $ Name                   : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ...
##  $ Sex                    : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age                    : num  22 38 26 35 35 27 54 2 27 14 ...
##  $ Siblings.Spouses.Aboard: int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parents.Children.Aboard: int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare                   : num  7.25 71.28 7.92 53.1 8.05 ...

The Pclass column is now correctly coded as a factor. We ignore the ordering as this goes beyond the scope of the course.

Titanic with main effects

## 
## Call:
## glm(formula = Survived ~ Age + Sex + Pclass, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.681  -0.665  -0.414   0.637   2.450  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.17948    0.22784    5.18  2.3e-07 ***
## Age         -0.03427    0.00716   -4.79  1.7e-06 ***
## Sexmale     -2.58872    0.18701  -13.84  < 2e-16 ***
## Pclass2      1.25633    0.22646    5.55  2.9e-08 ***
## Pclass1      2.45544    0.25322    9.70  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1182.77  on 886  degrees of freedom
## Residual deviance:  801.59  on 882  degrees of freedom
## AIC: 811.6
## 
## Number of Fisher Scoring iterations: 5

Titanic with interactions

fit.interaction <- titanic %$% glm(Survived ~ Age * Sex * Pclass, 
                                   family = binomial(link="logit"))
fit.interaction %>% summary %>% .$coefficients
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)           0.3854     0.3516   1.096   0.2730
## Age                  -0.0174     0.0140  -1.245   0.2132
## Sexmale              -1.2410     0.5197  -2.388   0.0169
## Pclass2               3.6638     1.3814   2.652   0.0080
## Pclass1               1.1122     1.4959   0.744   0.4572
## Age:Sexmale          -0.0226     0.0207  -1.094   0.2740
## Age:Pclass2          -0.0320     0.0385  -0.831   0.4059
## Age:Pclass1           0.0804     0.0528   1.521   0.1282
## Sexmale:Pclass2      -1.9112     1.5759  -1.213   0.2252
## Sexmale:Pclass1       0.8149     1.6602   0.491   0.6236
## Age:Sexmale:Pclass2  -0.0313     0.0494  -0.633   0.5265
## Age:Sexmale:Pclass1  -0.0800     0.0569  -1.407   0.1595

Now none of the main effects are significant. The variance (differences) have flowed into the interaction effects.

Interactions

An interaction occurs when the (causal) effect of one predictor on the outcome depends on the level of the (causal) effect of another predictor.

Image Source

E.g. the relation between body temperature and air temperature depends on the species.

Visualizing the effects

To illustrate, I will limit this investigation to Age and Pclass for males only.

  • We can use the predict function to illustrate the conditional probabilities within each class

To do so, we need to create a new data frame that has all the combinations of predictors we need.

new <- data.frame(Pclass = factor(rep(c(1, 2, 3), c(80, 80, 80))), 
                  Age = rep(1:80, times = 3),
                  Sex = rep("male", times = 240))
new <- cbind(new, 
             predict(fit.interaction, newdata = new, 
                     type = "link", se = TRUE))
head(new)
##   Pclass Age  Sex   fit se.fit residual.scale
## 1      1   1 male 1.032  0.596              1
## 2      1   2 male 0.992  0.583              1
## 3      1   3 male 0.952  0.569              1
## 4      1   4 male 0.913  0.555              1
## 5      1   5 male 0.873  0.542              1
## 6      1   6 male 0.833  0.528              1

Adding the predicted probabilities

There are two simple approaches to obtain the predicted probabilities. First, we could simply ask for the predicted response:

new$prob <- predict(fit.interaction, newdata = new, type = "response")

Or we could use the distribution function plogis():

new$prob2 <- plogis(new$fit)
head(new)
##   Pclass Age  Sex   fit se.fit residual.scale  prob prob2
## 1      1   1 male 1.032  0.596              1 0.737 0.737
## 2      1   2 male 0.992  0.583              1 0.729 0.729
## 3      1   3 male 0.952  0.569              1 0.722 0.722
## 4      1   4 male 0.913  0.555              1 0.714 0.714
## 5      1   5 male 0.873  0.542              1 0.705 0.705
## 6      1   6 male 0.833  0.528              1 0.697 0.697

Adding confidence intervals

new %<>% 
  mutate(lower = plogis(fit - 1.96 * se.fit),
         upper = plogis(fit + 1.96 * se.fit))

head(new)
##   Pclass Age  Sex   fit se.fit residual.scale  prob prob2 lower upper
## 1      1   1 male 1.032  0.596              1 0.737 0.737 0.466 0.900
## 2      1   2 male 0.992  0.583              1 0.729 0.729 0.463 0.894
## 3      1   3 male 0.952  0.569              1 0.722 0.722 0.459 0.888
## 4      1   4 male 0.913  0.555              1 0.714 0.714 0.456 0.881
## 5      1   5 male 0.873  0.542              1 0.705 0.705 0.453 0.874
## 6      1   6 male 0.833  0.528              1 0.697 0.697 0.450 0.866

What do we have?

A data frame with simulated Pclass and Age for males.

new %>% summary()
##  Pclass      Age         Sex           fit            se.fit     
##  1:80   Min.   : 1.0   male:240   Min.   :-7.37   Min.   :0.159  
##  2:80   1st Qu.:20.8              1st Qu.:-3.27   1st Qu.:0.288  
##  3:80   Median :40.5              Median :-1.79   Median :0.416  
##         Mean   :40.5              Mean   :-2.10   Mean   :0.513  
##         3rd Qu.:60.2              3rd Qu.:-0.74   3rd Qu.:0.609  
##         Max.   :80.0              Max.   : 1.03   Max.   :1.607  
##  residual.scale      prob           prob2           lower      
##  Min.   :1      Min.   :0.001   Min.   :0.001   Min.   :0.000  
##  1st Qu.:1      1st Qu.:0.037   1st Qu.:0.037   1st Qu.:0.012  
##  Median :1      Median :0.143   Median :0.143   Median :0.083  
##  Mean   :1      Mean   :0.213   Mean   :0.213   Mean   :0.136  
##  3rd Qu.:1      3rd Qu.:0.322   3rd Qu.:0.322   3rd Qu.:0.222  
##  Max.   :1      Max.   :0.737   Max.   :0.737   Max.   :0.466  
##      upper      
##  Min.   :0.015  
##  1st Qu.:0.108  
##  Median :0.250  
##  Mean   :0.312  
##  3rd Qu.:0.441  
##  Max.   :0.900

Visualizing the effects: link

Visualizing the effects: probabilities

new %>%
  ggplot(aes(x = Age, y = prob)) + 
  geom_ribbon(aes(ymin = lower, ymax = upper, fill = Pclass), alpha = .2) +
  geom_line(aes(colour = Pclass), lwd = 1) + ylab("Probability of Survival")

No interaction assumed

To repeat the previous scenario for the model where only main effects are included:

fit <- titanic %$% glm(Survived ~ Age + Sex + Pclass, family = binomial(link="logit")) 
new2 <- new %>% select(Age, Pclass, Sex)
new2 <- cbind(new2, predict(fit, newdata = new2, type = "link", se = TRUE))
new2 %<>% 
  mutate(prob = plogis(fit),
         lower = plogis(fit - 1.96 * se.fit),
         upper = plogis(fit + 1.96 * se.fit))

No interaction: the link

No interaction: probabilities

new2 %>%
  ggplot(aes(x = Age, y = prob)) + 
  geom_ribbon(aes(ymin = lower, ymax = upper, fill = Pclass), alpha = .2) +
  geom_line(aes(colour = Pclass), lwd = 1) + ylab("Probability of Survival")

Model comparison

anova(fit, fit.interaction, test = "Chisq") 
## Analysis of Deviance Table
## 
## Model 1: Survived ~ Age + Sex + Pclass
## Model 2: Survived ~ Age * Sex * Pclass
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       882        802                         
## 2       875        755  7     46.4  7.3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction model fits much better to the data.

AIC(fit, fit.interaction)
##                 df AIC
## fit              5 812
## fit.interaction 12 779

Prediction

pred <- predict(fit, type = "response")
head(pred)
##      1      2      3      4      5      6 
## 0.1031 0.9115 0.5716 0.9195 0.0686 0.0883
pred <- ifelse(pred > 0.5, yes = 1, no = 0)
head(pred)
## 1 2 3 4 5 6 
## 0 1 1 1 0 0

Confusion matrix

CM <- table(pred, titanic$Survived)
CM
##     
## pred   0   1
##    0 463 100
##    1  82 242

The accuracy can then be calculated as the percentage of correct predictions, i.e. the sum over the diagonal elements divided by the total number of cases:

correct <- CM %>% diag %>% sum
correct / sum(CM)
## [1] 0.795

In a function

confusionMatrix(as.factor(pred), reference = as.factor(titanic$Survived))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 463 100
##          1  82 242
##                                         
##                Accuracy : 0.795         
##                  95% CI : (0.767, 0.821)
##     No Information Rate : 0.614         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.563         
##                                         
##  Mcnemar's Test P-Value : 0.208         
##                                         
##             Sensitivity : 0.850         
##             Specificity : 0.708         
##          Pos Pred Value : 0.822         
##          Neg Pred Value : 0.747         
##              Prevalence : 0.614         
##          Detection Rate : 0.522         
##    Detection Prevalence : 0.635         
##       Balanced Accuracy : 0.779         
##                                         
##        'Positive' Class : 0             
## 

Sensitivity / Specificity

##                        Survived not.Survived
## Predicted Survived            A            B
## Predicted not.Survived        C            D
  • Sensitivity is calculated as \(\text{Sensitivity} = A/(A+C) = \frac{463}{463+82} =\) 0.85
    • Sensitivity is the proportion of actual Survived that have been correctly predicted as Survived.
    • Sensitivity is the True Positive Rate
  • Specificity is calculated as \(\text{Specificity} = D/(B+D) = \frac{242}{100+242} =\) 0.708
    • Specificity is the proportion of actual not Survived that have been correctly predicted as not Survived.
    • Specificity is the True Negative Rate

Cross validating predictive power

fit <- glm(Survived ~ Age + Sex + Pclass, family = binomial(link="logit"), 
           data = titanic)
set.seed(123)
cv <- CVbinary(fit)
## 
## Fold:  3 10 2 6 5 4 9 8 7 1
## Internal estimate of accuracy = 0.795
## Cross-validation estimate of accuracy = 0.793

The accuracy after crossvalidation is more or less the same as the accuracy in the data as a whole. This indicates that the fitted model may be useful to new data that comes from the same population; i.e. the model may be generalized to new data.

The `CVbinary() function relies on K-fold crossvalidation. This type of crossvalidation relies on randomly splitting the data into folds (parts). To allow you to replicate these results I fix the random seed.

  • In real life there is no need for fixing the random seed if reproducibility is not required.

Cross validating predictive power

set.seed(123)
CVbinary(fit, nfolds = 5)
## 
## Fold:  3 2 5 4 1
## Internal estimate of accuracy = 0.795
## Cross-validation estimate of accuracy = 0.797
CVbinary(fit, nfolds = 2)
## 
## Fold:  1 2
## Internal estimate of accuracy = 0.795
## Cross-validation estimate of accuracy = 0.782
  • Principal components analysis

Clustering

Custom theme for plots

helpIAmColourblind <- scale_color_manual(values = c("orange", 
                                                    "blue", 
                                                    "dark green"))

What is statistical learning?

Statistics is everywhere

Several questions involving statistics:

  1. What is the relation between \(X\) and \(Y\)? (estimation)
  2. What is the uncertainty around this effect? (estimation/inference)
  3. What can I conclude about my population? (inference/hypothesis testing)

  4. How can I best predict new observations? (prediction)
  5. How can I show relevant patterns in my data? (dimension reduction / pattern recognition)

Examples

  • Radiologists use statistics to sharpen their images (e.g. MRI) and improve their diagnosis.
  • Doctors use statistics to target your treatment to your symptoms or body.
  • Physicists use statistics to find useful patterns in the huge data dumps by the Large Hadron Collider.
  • Insurance companies use statistics to model risks for potential clients.
  • Google uses statistics to serve you targeted ads.
  • Netflix uses statistics to create hit shows.
  • Spotify uses statistics to suggest music to you.

Supervised learning

In supervised learning we aim to quantify the relation between \(Y\) and \(X\).

\(Y\):

  • target
  • outcome
  • dependent
  • response

\(X\):

  • features
  • predictors
  • independent
  • input

Supervised learning

We want to find the predictive function:

\[Y = f(X) + \epsilon \]

That minimizes \(\epsilon\) with respect to our goal.

  • Function \(f\) is an unknown, but fixed function of \(X = X1, \dots, Xp\)
  • \(p\) is the number of predictors
  • \(Y\) is the quantitative response
  • \(\epsilon \sim N(0, \sigma_\epsilon^2)\) is a random error term
  • \(\epsilon\) does not depend on \(X\)

Our aim is to find the \(f(X)\) that best represent the systematic information that \(X\) yields about \(Y\).

Supervised learning

With supervised learning every observation on our predictor

\[x_i, i=1, \dots, n\]

has a corresponding outcome measurement

\[y_i\] such that

\[\hat{y_i}=f({\bf x_i})\quad \text{and} \quad y_i = f({\bf x_i})+\epsilon_i.\]

Examples:

  • linear regression
  • logistic regression
  • k-nearest neighbours classifying

Unsupervised learning

With unsupervised learning we have a vector of measurement \(\bf x_i\) for every unit \(i=1, \dots, n\), but we miss the associated response \(y_i\).

  1. There is no outcome to predict

    • Hence you cannot fit e.g. a linear regression model
  2. There is no outcome to verify the model

    • We lack the truth to supervise our analysis

What can we do?

Find patterns in \(\bf x_1, \dots, x_n\)

We can use this model to e.g. find out if some cases are more similar than other cases or which variables explain most of the variation

Examples:

  • Principal Components Analysis
  • k-means clustering

K-means and K-nearest neighbours

Two nonparametric algorithms

K-nearest neighbours (KNN)

  • supervised learning
  • prediction
  • classification

K-means clustering (kmeans)

  • unsupervised learning
  • dimension reduction / pattern recognition
  • clustering

Example dataset

Let’s create some data from a multivariate normal distribution

We start with fixing the random seed

set.seed(123)

and specifying the variance covariance matrix:

sigma <- matrix(c(1, .5, .5, 1), 2, 2)
rownames(sigma) <- colnames(sigma) <- c("x1", "x2")

Data relations

sigma
##     x1  x2
## x1 1.0 0.5
## x2 0.5 1.0

Because the variances are 1, the resulting data will have a correlation of \[\rho = \frac{\text{cov}(y, x)}{\sigma_y\sigma_x} = \frac{.5}{1\times1} = .5.\]

Let’s draw the data

sim.data <- mvtnorm::rmvnorm(n     = 100, 
                             mean  = c(5, 5), 
                             sigma = sigma)
colnames(sim.data) <- c("x1", "x2")

Plot the data

sim.data %>% 
  as_tibble %>%
  ggplot(aes(x1, x2)) +
  geom_point()

Now add some clustering

sim.data <- 
  sim.data %>%
  as_tibble %>%
  mutate(class = sample(c("A", "B", "C"), size = 100, replace = TRUE))

We have added a new column that randomly assigns rows to level A, B or C

sim.data %>% head
## # A tibble: 6 x 3
##      x1    x2 class
##   <dbl> <dbl> <chr>
## 1  4.40  4.63 C    
## 2  6.52  5.47 C    
## 3  5.57  6.69 C    
## 4  5.12  3.90 A    
## 5  4.22  4.39 A    
## 6  6.28  5.66 C

Plot the data again

sim.data %>%
  ggplot(aes(x1, x2,  colour = class)) +
  geom_point() + 
  helpIAmColourblind

Adjust the clusters to make them distinct

sim.data <- 
  sim.data %>%
  mutate(x2 = case_when(class == "A" ~ x2 + 1.5,
                        class == "B" ~ x2 - 1.5,
                        class == "C" ~ x2 + 1.5),
         x1 = case_when(class == "A" ~ x1 - 1.5,
                        class == "B" ~ x1 - 0,
                        class == "C" ~ x1 + 1.5))

The result: supervised

sim.data %>%
  ggplot(aes(x1, x2,  colour = class)) +
  geom_point() + 
  helpIAmColourblind

The result: unsupervised

sim.data %>%
  ggplot(aes(x1, x2)) +
  geom_point()

K-Nearest Neighbors

How does it work?

  1. For every test observation \(x_0\) the \(K\) points that are close to \(x_0\) are identified.
  2. These closest points form set \(\mathcal{N}_0\).
  3. We estimate the probability for \(x_0\) being part of class \(j\) as the fraction of points in \(\mathcal{N}_0\) for whom the response equals \(j\): \[P(Y = j | X = x_0) = \frac{1}{K}\sum_{i\in\mathcal{N}_0}I(y_i=j)\]

  4. The observation \(x_0\) is classified to the class with the largest probability

In short

An observation gets that class assigned to which most of its \(K\) neighbours belong

Why KNN?

Because \(X\) is assigned to the class to which most of the observations belong it is

  • non-parametric

    • no assumptions about the distributions, or the shape of the decision boundary
  • expected to be far better than logistic regression when decision boundaries are non-linear

However, we do not get parameters as with LDA and regression.

  • We thus cannot determine the relative importance of predictors
  • The “model” == the existing observations: instance-based learning

Fitting a K-NN model

First we need to determine a training set

set.seed(123)
sim.data <-
  sim.data %>% 
  mutate(set = sample(c("Train", "Test"), size = 100, 
                      prob = c(.25, .75), replace = TRUE))
sim.data
## # A tibble: 100 x 4
##       x1    x2 class set  
##    <dbl> <dbl> <chr> <chr>
##  1  5.90  6.13 C     Test 
##  2  8.02  6.97 C     Train
##  3  7.07  8.19 C     Test 
##  4  3.62  5.40 A     Train
##  5  2.72  5.89 A     Train
##  6  7.78  7.16 C     Test 
##  7  5.42  3.71 B     Test 
##  8  6.43  8.08 C     Train
##  9  4.97  1.73 B     Test 
## 10  4.06  6.22 A     Test 
## # … with 90 more rows

Fitting a K-NN model

Then we split the data into a training (build the model) and a test (verify the model) set

train.data <- subset(sim.data, set == "Train", select = c(x1, x2))
test.data <-  subset(sim.data, set == "Test",  select = c(x1, x2))
obs.class <-  subset(sim.data, set == "Train", select = class)

Now we can fit the K-NN model

fit.knn <- knn(train = train.data,
               test  = test.data, 
               cl    = as.matrix(obs.class),
               k     = 3)
fit.knn
##  [1] C C C B B A B B A A C C A A C A A C C C B C C B B A B B B B A B A B A
## [36] C A C C B C C C A A C B C B A A B B C C A B B C C C B A B B C B A C A
## [71] C B A
## Levels: A B C

Predictions

class.test <- subset(sim.data, set == "Test", select = class) %>%
  as.matrix()
correct <- fit.knn == class.test
mean(correct)
## [1] 0.959
table(fit.knn, class.test)
##        class.test
## fit.knn  A  B  C
##       A 21  0  0
##       B  0 24  1
##       C  0  2 25

The (in)correct responses KNN = 3

cbind(test.data, correct) %>%
  ggplot(aes(x1, x2,  colour = correct)) +
  geom_point() +
  scale_colour_manual(values = c("red", "black"))

Fewer neighbours

fit.knn <- knn(train = train.data,
               test  = test.data, 
               cl    = as.matrix(obs.class),
               k     = 2)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.945
table(fit.knn, class.test)
##        class.test
## fit.knn  A  B  C
##       A 21  1  1
##       B  0 23  0
##       C  0  2 25

More neighbours

fit.knn <- knn(train = train.data,
               test  = test.data, 
               cl    = as.matrix(obs.class),
               k     = 4)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.945
table(fit.knn, class.test)
##        class.test
## fit.knn  A  B  C
##       A 21  0  1
##       B  0 24  1
##       C  0  2 24

Even more neighbours

fit.knn <- knn(train = train.data,
               test = test.data, 
               cl = as.matrix(obs.class),
               k = 10)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.89
table(fit.knn, class.test)
##        class.test
## fit.knn  A  B  C
##       A 21  1  5
##       B  0 25  2
##       C  0  0 19

The (in)correct responses KNN = 10

cbind(test.data, correct) %>%
  ggplot(aes(x1, x2,  colour = correct)) +
  geom_point() +
  scale_colour_manual(values = c("red", "black"))

Predicting a new observation

Let’s make a new observation:

newObs <- data.frame(x1 = 5.5, x2 = 4.5)

Predicting a new observation

Predicting a new observation

Now we predict the class of this new observation, using the entire data for training our model

knn(train = sim.data[, 1:2], cl = sim.data$class, k = 10, test = newObs)
## [1] B
## Levels: A B C

K-means clustering

Remember: unsupervised

sim.data %>%
  ggplot(aes(x1, x2)) +
  geom_point()

Goal: finding clusters in our data

K-means clustering partitions our dataset into \(K\) distinct, non-overlapping clusters or subgroups.

What is a cluster?

A set of relatively similar observations.

What is “relatively similar”?

This is up to the programmer/researcher to decide. For example, we can say the “within-class” variance is as small as possible and the between-class variance as large as possible.

Why perform clustering?

We expect clusters in our data, but weren’t able to measure them

  • potential new subtypes of cancer tissue

We want to summarise features into a categorical variable to use in further decisions/analysis

  • subgrouping people by their spending types

The k-means algorithm

  1. Randomly assign values to K classes
  2. Calculate the centroid (colMeans) for each class
  3. Assign each value to its closest centroid class
  4. If the assignments changed, go to step 2. else stop.

Source: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: Springer.

The k-means algorithm

K is a tuning parameter (centers)

(fitkm <- kmeans(sim.data[, 1:2], centers = 3))
## K-means clustering with 3 clusters of sizes 39, 35, 26
## 
## Cluster means:
##     x1   x2
## 1 4.99 3.53
## 2 3.59 6.39
## 3 6.85 6.98
## 
## Clustering vector:
##   [1] 3 3 3 2 2 3 1 3 1 2 1 1 1 2 2 2 3 3 2 2 3 2 3 1 2 2 3 1 2 1 2 1 3 1 3
##  [36] 1 2 1 2 1 1 1 1 2 1 2 1 2 3 1 2 3 1 1 1 3 2 2 1 1 2 2 3 1 3 1 1 1 2 1
##  [71] 2 2 2 1 1 3 3 2 1 1 3 3 3 3 1 2 2 1 2 1 1 3 1 2 3 2 2 3 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 49.3 46.5 49.2
##  (between_SS / total_SS =  73.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The result:

sim.data$clust <- as.factor(fitkm$cluster)

sim.data %>% ggplot +
  geom_point(aes(x = x1, y = x2, colour = clust)) +
  helpIAmColourblind

Comparison

# this is the data-generating class

sim.data %>% ggplot +
  geom_point(aes(x = x1, y = x2, colour = class)) +
  helpIAmColourblind

Centroids

sim.data %>% ggplot +
  geom_point(aes(x = x1, y = x2, colour = clust)) +
  geom_point(aes(x = x1, y = x2), data = as.data.frame(fitkm$centers),
             size = 5, col = "red", alpha = 0.8) +
  helpIAmColourblind

K = 5

K = 2